Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Sci Rep ; 7(1): 728, 2017 04 07.
Artigo em Inglês | MEDLINE | ID: mdl-28389642

RESUMO

We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We speculate that patterns of overrepresentation of short distances between symmetric word pairs may allow the occurrence of non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised by the surprising fact that they occur at single distances more frequently than expected.


Assuntos
DNA , Genoma Humano , Genômica , Análise de Sequência de DNA , Algoritmos , Cromossomos Humanos , DNA/química , DNA/genética , Bases de Dados Genéticas , Genômica/métodos , Humanos , Cadeias de Markov , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Relação Estrutura-Atividade
2.
Clin Sci (Lond) ; 129(12): 1163-72, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26432088

RESUMO

Baroreceptor reflex sensitivity (BRS) is an important prognostic factor because a reduced BRS has been associated with an adverse cardiovascular outcome. The threshold for a 'reduced' BRS was established by the ATRAMI study at BRS <3 ms/mmHg in patients with a previous myocardial infarction, and has been shown to improve risk assessment in many other cardiac dysfunctions. The successful application of this cut-off to other populations suggests that it may reflect an inherent property of baroreflex functioning, so our goal is to investigate whether it represents a 'natural' partition of BRS values. As reduced baroreflex responsiveness is also associated with ageing, we investigated whether a BRS estimate <3 ms/mmHg could be the result of a process of physiological senescence as well as a sign of BRS dysfunction. This study involved 228 chronic heart failure patients and 60 age-matched controls. Our novel method combined transfer function BRS estimation and automatic clustering of BRS probability distributions, to define indicative levels of different BRS activities. The analysis produced a fit clustering (cophenetic correlation coefficient 0.9 out of 1) and identified one group of homogeneous patients (well separated from the others by 3 ms/mmHg) with an increased BRS-based mortality risk [hazard ratio (HR): 3.19 (1.73, 5.89), P<0.001]. The age-dependent BRS cut-off, estimated by 5% quantile regression of log (BRS) with age (considering the age-matched controls), provides a similar mortality value [HR: 2.44 (1.37, 4.43), P=0.003]. In conclusion, the 3 ms/mmHg cut-off identifies two large clusters of homogeneous heart failure (HF) patients, thus supporting the hypothesis of a natural cut-off in the HF population. Furthermore, age was found to have no statistical impact on risk assessment, suggesting that there is no need to establish age-based cut-offs because 3 ms/mmHg optimally identifies patients at high mortality risk.


Assuntos
Barorreflexo , Pressão Sanguínea , Insuficiência Cardíaca/fisiopatologia , Frequência Cardíaca , Adulto , Fatores Etários , Idoso , Doença Crônica , Análise por Conglomerados , Feminino , Insuficiência Cardíaca/diagnóstico , Insuficiência Cardíaca/mortalidade , Humanos , Estimativa de Kaplan-Meier , Masculino , Pessoa de Meia-Idade , Valor Preditivo dos Testes , Prognóstico , Modelos de Riscos Proporcionais , Estudos Retrospectivos , Medição de Risco , Fatores de Risco
3.
Sci Rep ; 5: 10203, 2015 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-25984837

RESUMO

Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.


Assuntos
Biologia Computacional/métodos , Rearranjo Gênico , Genômica/métodos , Algoritmos , Animais , Humanos , Navegador
4.
Bioinformatics ; 31(15): 2421-5, 2015 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-25840045

RESUMO

MOTIVATION: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed. RESULTS: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome. We identify the shortest such sequences with lengths between 12 and 14. Only three absent sequences of length 12 exist and they consistently appear at the same location on two of the Ebola virus proteins, in all Ebola virus genomes, but nowhere in the human genome. The alignment-free method used is able to identify pathogen-specific signatures for quick and precise action against infectious agents, of which the current Ebola virus outbreak provides a compelling example.


Assuntos
DNA Viral/química , Ebolavirus/genética , Surtos de Doenças , Genoma Humano , Genoma Viral , Doença pelo Vírus Ebola/epidemiologia , Doença pelo Vírus Ebola/virologia , Humanos , Análise de Sequência de DNA , Proteínas Virais/genética
5.
PLoS One ; 8(11): e79922, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24278218

RESUMO

Data summarization and triage is one of the current top challenges in visual analytics. The goal is to let users visually inspect large data sets and examine or request data with particular characteristics. The need for summarization and visual analytics is also felt when dealing with digital representations of DNA sequences. Genomic data sets are growing rapidly, making their analysis increasingly more difficult, and raising the need for new, scalable tools. For example, being able to look at very large DNA sequences while immediately identifying potentially interesting regions would provide the biologist with a flexible exploratory and analytical tool. In this paper we present a new concept, the "information profile", which provides a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The computation of the information profiles is computationally tractable: we show that it can be done in time proportional to the length of the sequence. We also describe a tool to compute the information profiles of a given DNA sequence, and use the genome of the fission yeast Schizosaccharomyces pombe strain 972 h(-) and five human chromosomes 22 for illustration. We show that information profiles are useful for detecting large-scale genomic regularities by visual inspection. Several discovery strategies are possible, including the standalone analysis of single sequences, the comparative analysis of sequences from individuals from the same species, and the comparative analysis of sequences from different organisms. The comparison scale can be varied, allowing the users to zoom-in on specific details, or obtain a broad overview of a long segment. Software applications have been made available for non-commercial use at http://bioinformatics.ua.pt/software/dna-at-glance.


Assuntos
DNA Fúngico/genética , DNA/genética , Análise de Sequência de DNA/métodos , Cromossomos Fúngicos , Cromossomos Humanos , Humanos , Schizosaccharomyces/genética , Telômero
6.
J Theor Biol ; 335: 153-9, 2013 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-23831271

RESUMO

Previous studies have suggested that Chargaff's second rule may hold for relatively long words (above 10nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule. We use several techniques to evaluate the phenomenon of symmetry, including Pearson's correlation coefficient, total variational distance, a novel word symmetry distance, as well as traditional and equivalence statistical tests. We conclude that word symmetries are statistical significant in the human genome for word lengths up to 6nucleotides. For longer words, we present evidence that the phenomenon may not be as prevalent as previously thought.


Assuntos
Cromossomos Humanos/genética , Genoma Humano/fisiologia , Modelos Genéticos , Cromossomos Humanos/metabolismo , Humanos , Transcriptoma/fisiologia
7.
J Integr Bioinform ; 8(3): 172, 2011 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-21926435

RESUMO

We study the inter-dinucleotide distance distributions in the human genome, both in the whole-genome and protein-coding regions. The inter-dinucleotide distance is defined as the distance to the next occurrence of the same dinucleotide. We consider the 16 sequences of inter-dinucleotide distances and two reading frames. Our results show a period-3 oscillation in the protein-coding inter-dinucleotide distance distributions that is absent from the whole-genome distributions. We also compare the distance distribution of each dinucleotide to a reference distribution, that of a random sequence generated with the same dinucleotide abundances, revealing the CG dinucleotide as the one with the highest cumulative relative error for the first 60 distances. Moreover, the distance distribution of each dinucleotide is compared to the distance distribution of all other dinucleotides using the Kullback-Leibler divergence. We find that the distance distribution of a dinucleotide and that of its reversed complement are very similar, hence, the divergence between them is very small. This is an interesting finding that may give evidence of a stronger parity rule than Chargaff's second parity rule.


Assuntos
Variação Genética/fisiologia , Genoma Humano/fisiologia , Fases de Leitura/fisiologia , Análise de Sequência de DNA/métodos , Animais , Humanos
8.
PLoS One ; 6(6): e21588, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21738720

RESUMO

A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.


Assuntos
Biologia Computacional/métodos , Genoma/genética , Animais , Humanos , Cadeias de Markov
9.
PLoS One ; 6(1): e16065, 2011 Jan 31.
Artigo em Inglês | MEDLINE | ID: mdl-21386877

RESUMO

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.


Assuntos
Células Eucarióticas/metabolismo , Genoma/genética , Células Procarióticas/metabolismo , Animais , Composição de Bases/genética , Sequência de Bases , Padrões de Herança/genética , Dados de Sequência Molecular , Nucleotídeos/genética , Filogenia , Vertebrados/genética
10.
J Theor Biol ; 275(1): 52-8, 2011 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-21295040

RESUMO

DNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species. Recently, we have proposed a methodology based on the inter-nucleotide distance, which proved to contribute to the discrimination among species. In this paper, we introduce a new distance, the distance to the nearest dissimilar nucleotide, which is the distance of a nucleotide to first occurrence of a different nucleotide. This distance is related to the repetition structure of single nucleotides. Using the information resulting from the concatenation of the distance to the nearest dissimilar and the inter-nucleotide distance, we found that this new distance brings additional discriminative capabilities. This suggests that the distance to the nearest dissimilar nucleotide might contribute with useful information about the evolution of the species.


Assuntos
Genoma/genética , Modelos Genéticos , Nucleotídeos/genética , Animais , Sequência de Bases , Humanos , Dados de Sequência Molecular , Filogenia , Alinhamento de Sequência , Especificidade da Espécie
11.
Bioinformatics ; 25(23): 3064-70, 2009 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-19759198

RESUMO

MOTIVATION: DNA sequences can be represented by sequences of four symbols, but it is often useful to convert the symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but they seem unrelated to any intrinsic characteristic of DNA. The objective of this work was to find a mapping scheme directly related to DNA characteristics and that would be useful in discriminating between different species. Mathematical models to explore DNA correlation structures may contribute to a better knowledge of the DNA and to find a concise DNA description. RESULTS: We developed a methodology to process DNA sequences based on inter-nucleotide distances. Our main contribution is a method to obtain genomic signatures for complete genomes, based on the inter-nucleotide distances, that are able to discriminate between different species. Using these signatures and hierarchical clustering, it is possible to build phylogenetic trees. Phylogenetic trees lead to genome differentiation and allow the inference of phylogenetic relations. The phylogenetic trees generated in this work display related species close to each other, suggesting that the inter-nucleotide distances are able to capture essential information about the genomes. To create the genomic signature, we construct a vector which describes the inter-nucleotide distance distribution of a complete genome and compare it with the reference distance distribution, which is the distribution of a sequence where the nucleotides are placed randomly and independently. It is the residual or relative error between the data and the reference distribution that is used to compare the DNA sequences of different organisms.


Assuntos
DNA/química , Genoma , Genômica/métodos , Nucleotídeos/química , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Filogenia
12.
BMC Bioinformatics ; 10: 137, 2009 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-19426495

RESUMO

BACKGROUND: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. RESULTS: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws. CONCLUSION: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.


Assuntos
Algoritmos , Sequência de Bases , DNA/química , Genômica/métodos , Análise de Sequência de DNA/métodos , Bases de Dados de Ácidos Nucleicos
13.
IEEE Trans Biomed Eng ; 53(11): 2148-55, 2006 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-17073319

RESUMO

It is known that the protein-coding regions of DNA are usually characterized by a three-base periodicity. In this paper, we exploit this property, studying a DNA model based on three deterministic states, where each state implements a finite-context model. The experimental results obtained confirm the appropriateness of the proposed approach, showing compression gains in relation to the single finite-context model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three base positions of a codon differs and that this variation is not the same among the organisms analyzed.


Assuntos
Algoritmos , DNA/genética , Modelos Genéticos , Fases de Leitura Aberta/genética , Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Dados de Sequência Molecular
14.
Phys Rev E Stat Nonlin Soft Matter Phys ; 70(3 Pt 1): 031910, 2004 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-15524552

RESUMO

This paper explores the connection between the size of the spectral coefficients of a nucleotide or any other symbolic sequence and the distribution of nucleotides along certain subsequences. It explains the connection between the nucleotide distribution and the size of the spectral coefficients, and gives a necessary and sufficient condition for a coefficient to have a prescribed magnitude. Furthermore, it gives a fast algorithm for computing the value of a given spectral coefficient of a nucleotide sequence, discussing periods 3 and 4 as examples. Finally, it shows that the spectrum of a symbolic sequence is redundant, in the sense that there exists a linear recursion that determines the values of all the coefficients from those of a subset.


Assuntos
Algoritmos , Sequência de Bases , Análise Mutacional de DNA/métodos , DNA/análise , DNA/química , Análise Numérica Assistida por Computador , Análise de Sequência de DNA/métodos , Frequência do Gene , Dados de Sequência Molecular
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...